8 research outputs found

    Feature Selection Techniques and Classification Accuracy of Supervised Machine Learning in Text Mining

    Get PDF
    Text mining is a special case of data mining which explore unstructured or semi-structured text documents, to establish valuable patterns and rules that indicate trends and significant features about specific topics. Text mining has been in pattern recognition, predictive studies, sentiment analysis and statistical theories in many areas of research, medicine, financial analysis, social life analysis, and business intelligence. Text mining uses concept of natural language processing and machine learning. Machine learning algorithms have been used and reported to give great results, but their performance of machine learning algorithms is affected by factors such as dataset domain, number of classes, length of the corpus, and feature selection techniques used. Redundant attribute affects the performance of the classification algorithm, but this can be reduced by using different feature selection techniques and dimensionality reduction techniques.  Feature selection is a data preprocessing step that chooses a subset of input variable while eliminating features with little or no predictive information. Feature selection techniques are Information gain, Term Frequency, Term Frequency-Inverse document frequency, Mutual Information, and Chi-Square, which can use a filters, wrappers, or embedded approaches. To get the most value from machine learning, pairing the best algorithms with the right tools and processes is necessary. Little research has been done on the effect of feature selection techniques on classification accuracy for pairing of these algorithms with the best feature selection techniques for optimal results. In this research, a text classification experiment was conducted using incident management dataset, where incidents were classified into their resolver groups. Support vector machine (SVM), K-Nearest Neighbors (KNN), Naïve Bayes (NB) and Decision tree (DT) machine learning algorithms were examined. Filtering approach was used on the feature selection techniques, with different ranking indices applied for optimal feature set and classification accuracy results analyzed. The classification accuracy results obtained using TF were, 88% for SVM, 70% for NB, 79% for Decision tree, and KNN had 55%, while Boolean registered 90%, 83%, 82% and 75%, for SVM, NB, DT, and KNN respectively. TF-IDF, had 91%, 83%, 76%, and 56% for SVM, NB, DT, and KNN respectively. The results showed that algorithm performance is affected by feature selection technique applied. SVM performed best, followed by DT, KNN and finally NB. In conclusion, presence of noisy data leads to poor learning performance and increases the computational time. The classifiers performed differently depending on the feature selection technique applied. For optimal results, the classifier that performed best together with the feature selection technique with the best feature subset should be applied for all types of data for accurate classification performance. Keywords: Text Classification, Supervised Machine Learning, Feature Selection DOI: 10.7176/JIEA/9-3-06 Publication date:May 31st 201

    N-gram Based Text Categorization Method for Improved Data Mining

    Get PDF
    Though naïve Bayes text classifiers are widely used because of its simplicity and effectiveness, the techniques for improving performances of these classifiers have been rarely studied. Naïve Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability of features belonging to a class, which the features are selected by feature selection methods. However, its performance is often imperfect because it does not model text well, and by inappropriate feature selection and some disadvantages of the Naive Bayes itself. Sentiment Classification or Text Classification is the act of taking a set of labeled text documents, learning a correlation between a document’s contents and its corresponding labels and then predicting the labels of a set of unlabeled test documents as best as possible. Text Classification is also sometimes called Text Categorization. Text classification has many applications in natural language processing tasks such as E-mail filtering, Intrusion detection systems, news filtering, prediction of user preferences, and organization of documents. The Naive Bayes model makes strong assumptions about the data: it assumes that words in a document are independent. This assumption is clearly violated in natural language text: there are various types of dependences between words induced by the syntactic, semantic, pragmatic and conversational structure of a text. Also, the particular form of the probabilistic model makes assumptions about the distribution of words in documents that are violated in practice. We address this problem and show that it can be solved by modeling text data differently using N-Grams. N-gram Based Text Categorization is a simple method based on statistical information about the usage of sequences of words. We conducted an experiment to demonstrate that our simple modification is able to improve the performance of Naive Bayes for text classification significantly. Keywords: Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams

    Content Based Image Retrieval Using Colour, Texture and KNN

    Get PDF
    Image retrieval is increasingly becoming an interesting filed of research as the images that users store and process keep on rising both in number and size especially in digital databases. The images are stored on portable devices which users have used to capture these images. The aim of this research is to solve the issues experienced by users in image retrieval of digital images stored in their devices, ensuring that images requested are retrieved accurately from storage. The images are pre-processed to remove noise and refocus images to enhance mage content. The image retrieval is based on the content (Content Based Image Retrieval) where images are matched in a database based on the subject of the image.  In this paper, Corel image database is used with image pre-processing to ensure that image subjects are enhanced. Images are placed in classes and images are retrieved based on the users input. Euclidean distance method is used to determine the nearest objects, thus resulting in the least number of images retrieved by the system. Colour and texture features are used to generate the feature matrices on which the image comparison is made. For KNN algorithm, different values of K will be tested to determine best value for different classes of images. The performance of the design is compared to MATLAB image retrieval system using the same image data set. The results obtained show that the combination of colour, texture and KNN in image retrieval results in shorter computation time as compared to the performance of individual methods. Keywords: Image retrieval, KNN, clustering, image processin

    Using Keystroke Dynamics and Location Verification Method for Mobile Banking Authentication.

    Get PDF
    With the rise of security attacks on mobile phones, traditional methods to authentication such as Personal Identification Numbers (PIN) and Passwords are becoming ineffective due to their limitations such as being easily forgettable, discloser, lost or stolen. Keystroke dynamics is a form of behavioral biometric based authentication where an analysis of how users type is monitored and used in authenticating users into a system. The use of location data provides a verification mechanism based on user’s location which can be obtained via their phones Global Positioning System (GPS) facility. This study evaluated existing authentication methods and their performance summarized. To address the limitations of traditional authentication methods this paper proposed an alternative authentication method that uses Keystroke dynamics and location data. To evaluate the proposed authentication method experiments were done through use of a prototype android mobile banking application that captured the typing behavior while logging in and location data from 60 users. The experiment results were lower compared to the previous studies provided in this paper with a False Rejection Rate (FRR) of 5.33% which is the percentage of access attempts by legitimate users that have been rejected by the system and a False Acceptance Rate (FAR) of 3.33% which is the percentage of access attempts by imposters that have been accepted by the system incorrectly, giving an Equal Error Rate (EER) of 4.3%.The outcome of this study demonstrated keystroke dynamics and location verification on PINs as an alternative authentication of mobile banking transactions building on current smartphones features with less implementation costs with no additional hardware compared to other biometric methods. Keywords: smartphones, biometric, mobile banking, keystroke dynamics, location verification, securit

    Enhancing Information Retrieval Relevance Using Touch Dynamics on Search Engine

    Get PDF
    Using Touch Dynamics on Search Engine is an attempt to establish the possibilities of using user touch behavior which is monitored and several unique features are extracted. The unique features are used for identifying users and their traits according to the touch dynamics. The results can be used for defining automatic user unique searching behavior. Touch dynamics has been discussed in several studies in the context of user authentication and biometric identification for security purposes. This study establishes the possibility of integrating touch dynamics results for identifying user searching preferences and interests. This study investigates a technique of combining personalized search with touch dynamics results information as an approach for determining user preferences, interest measurement and context. Keywords: Personalized Search, Information Retrieval, Touch Dynamics, Search Engin

    A Model for Coronary Heart Disease Prediction Using Data Mining Classification Techniques

    No full text
    Nowadays the guts malady is one amongst the foremost causes of death within the world. Thus it s early prediction and diagnosing is vital in medical field, which might facilitate in on time treatment, decreasing health prices and decreasing death caused by it. The treatment values the disease is not cheap by most of the patients and Clinical choices are usually raised supported by doctors‟ intuition and skill instead of on the knowledge-rich information hidden within the stored data. The model  for prediction of heart disease using a classification techniques in data mining reduce medical errors, decreases unwanted exercise variation, enhance patient well-being and improves patient results. The model has been developed to support decision making in heart disease prediction based on data mining techniques. The experiments were performed using the model, based on the three techniques, and their accuracy in prediction noted. The decision tree, naïve Bayes, KNN (K-Nearest Neighbors) and WEKA API (Waikato Environment for Knowledge Analysis-application programming interface) were the various data mining methods that were used. The model predicts the likelihood of getting a heart disease using more input medical attributes. 13 attributes that is: blood pressure, sex, age, cholesterol, blood sugar among other factors such as genetic factors, sedentary behavior, socio-economic status and race has been use to predict the likelihood of patient getting a Heart disease until now. This study research added two more attributes that is: Obesity and Smoking.740 Record sets with medical attributes was obtained from a publicly available database for heart disease from machine learning repository with the help of the datasets, and the patterns significant to the heart attack prediction was extracted and divided into two data sets, one was used for training which consisted of 296 records & another for testing consisted of 444 records, and the fraction of accuracy of every data mining classification that was applied was used as standard for performance measure. The performance was compared by calculating the confusion matrix that assists to find the precision recall and accuracy. High performance and accuracy was provided by the complete system model. Comparison between the proposed techniques and the existing one in the prediction capability was presented. The model system assists clinicians in survival rate prediction of an individual patient and future medication is planned for. Consequently, the families, relatives, and their patients can plan for treatment preferences and plan for their budget consequently
    corecore